Acerca del set de datos¶

Los datos estÔn relacionados con campañas de marketing directo (llamadas telefónicas) de una institución bancaria portuguesa. El objetivo de la clasificación es predecir si el cliente suscribirÔ un depósito a plazo (variable y). Los datos estÔn relacionados con campañas de marketing directo de una institución bancaria portuguesa. A menudo, se requería mÔs de un contacto con el mismo cliente, para poder acceder si el producto (depósito bancario a plazo) estaría ('sí') o no ('no') suscrito. Por tanto, el objetivo de la clasificación es predecir si el cliente suscribirÔ (sí/no) un depósito a plazo (variable y).

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
sns.set_theme(color_codes=True)

import warnings 
warnings.filterwarnings("ignore")
In [2]:
df = pd.read_csv("bank-additional-full.csv", delimiter=";")
In [3]:
pd.set_option("display.max_columns", None)
df.head()
Out[3]:
age job marital education default housing loan contact month day_of_week duration campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y
0 56 housemaid married basic.4y no no no telephone may mon 261 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
1 57 services married high.school unknown no no telephone may mon 149 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
2 37 services married high.school no yes no telephone may mon 226 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
3 40 admin. married basic.6y no no no telephone may mon 151 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
4 56 services married high.school no no yes telephone may mon 307 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  pdays           41188 non-null  int64  
 13  previous        41188 non-null  int64  
 14  poutcome        41188 non-null  object 
 15  emp.var.rate    41188 non-null  float64
 16  cons.price.idx  41188 non-null  float64
 17  cons.conf.idx   41188 non-null  float64
 18  euribor3m       41188 non-null  float64
 19  nr.employed     41188 non-null  float64
 20  y               41188 non-null  object 
dtypes: float64(5), int64(5), object(11)
memory usage: 6.6+ MB
In [5]:
df.isnull().sum()
Out[5]:
age               0
job               0
marital           0
education         0
default           0
housing           0
loan              0
contact           0
month             0
day_of_week       0
duration          0
campaign          0
pdays             0
previous          0
poutcome          0
emp.var.rate      0
cons.price.idx    0
cons.conf.idx     0
euribor3m         0
nr.employed       0
y                 0
dtype: int64

Análisis exploratorio de los datos¶

In [6]:
#Seleccionar datos categoricos 
df_categoricos = df[["job", "marital", "education", "default", "housing", "loan", "contact", "month", "day_of_week", 
                    "poutcome", "y"]]
df_categoricos.head()
Out[6]:
job marital education default housing loan contact month day_of_week poutcome y
0 housemaid married basic.4y no no no telephone may mon nonexistent no
1 services married high.school unknown no no telephone may mon nonexistent no
2 services married high.school no yes no telephone may mon nonexistent no
3 admin. married basic.6y no no no telephone may mon nonexistent no
4 services married high.school no no yes telephone may mon nonexistent no
In [7]:
#Seleccionar datos nĆŗmericos
df_numericos = df[["age", "duration", "campaign", "pdays", "previous", "emp.var.rate", "cons.price.idx", 
                  "cons.conf.idx", "euribor3m", "nr.employed"]]
df_numericos.head()
Out[7]:
age duration campaign pdays previous emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed
0 56 261 1 999 0 1.1 93.994 -36.4 4.857 5191.0
1 57 149 1 999 0 1.1 93.994 -36.4 4.857 5191.0
2 37 226 1 999 0 1.1 93.994 -36.4 4.857 5191.0
3 40 151 1 999 0 1.1 93.994 -36.4 4.857 5191.0
4 56 307 1 999 0 1.1 93.994 -36.4 4.857 5191.0
In [8]:
# Documentar sns.countplot: https://seaborn.pydata.org/generated/seaborn.countplot.html

cat_vars = ["job", "marital", "education", "default", "housing", "loan", "contact", "month", "day_of_week", "poutcome"]

#Crear figuras con subplots
fig, axs = plt.subplots(nrows=2, ncols=5, figsize = (20, 10))
axs = axs.flatten()

#Crear un countplot para cada variable categorica 
for i, var in enumerate (cat_vars):
    sns.countplot(x=var, hue="y", data = df_categoricos, ax=axs[i])
    axs[i].set_xticklabels(axs[i].get_xticklabels(), rotation = 90)
    
#Ajustar espacio entre subplots
fig.tight_layout()

#Mostrar el plot
plt.show()
In [9]:
# Documentar seaborn.histplot: https://seaborn.pydata.org/generated/seaborn.histplot.html

#Lista  de variables categoricas 
cat_vars = ["job", "marital", "education", "default", "housing", "loan", "contact", "month", "day_of_week", "poutcome"]

#Crear figuras con subplots
fig, axs = plt.subplots(nrows=2, ncols=5, figsize=(20, 10))
axs = axs.flatten()

#Crear histogramas para cada variable categorica 
for i, var in enumerate (cat_vars):
    sns.histplot(x=var, hue="y", data = df_categoricos, ax=axs[i], multiple = "fill", kde = False, element = "bars", fill= True, stat = "density")
    axs[i].set_xticklabels(df_categoricos[var].unique(), rotation=90)
    axs[i].set_xlabel(var)
    
#Ajustar el especio entre subplots
fig.tight_layout()

#Mostrar el plot
plt.show()
  1. La mayoría de las personas que suscriben los depósitos bancarios a plazo son: jubilados y estudiantes.

  2. La mayoría de las personas que suscriben los depósitos bancarios a plazo son cantactados por vía celular.

  3. La mayoría de las personas que suscriben los depósitos bancarios a plazo tienen su último contacto en: octubre, diciembre, marzo, septiembre

  4. La mayoría de las personas que suscriben el depósito bancario a plazo han valorado exitosamente la campaña de marketing.

Resaltar algunos aspectos de la distribución de los datos¶

In [10]:
# Documentar sns.boxplot: https://seaborn.pydata.org/generated/seaborn.boxplot.html

num_vars = ["age", "duration", "campaign", "pdays", "previous", "emp.var.rate", "cons.price.idx", 
                  "cons.conf.idx", "euribor3m", "nr.employed"]

fig, axs = plt.subplots(nrows=2, ncols=5, figsize=(20, 10))
axs = axs.flatten()

for i, var in enumerate(num_vars): 
    sns.boxplot(x=var, data=df, ax=axs[i])
    
fig.tight_layout()
    
plt.show()
In [11]:
# Documentar sns.violinplot: https://seaborn.pydata.org/generated/seaborn.violinplot.html

num_vars = ["age", "duration", "campaign", "pdays", "previous", "emp.var.rate", "cons.price.idx", 
                  "cons.conf.idx", "euribor3m", "nr.employed"]

fig, axs = plt.subplots(nrows=2, ncols=5, figsize=(20, 10))
axs = axs.flatten()

for i, var in enumerate(num_vars): 
    sns.violinplot(x=var, data=df, ax=axs[i])
    
fig.tight_layout()
    
plt.show()
In [12]:
num_vars = ["age", "duration", "campaign", "pdays", "previous", "emp.var.rate", "cons.price.idx", 
                  "cons.conf.idx", "euribor3m", "nr.employed"]

fig, axs = plt.subplots(nrows=2, ncols=5, figsize=(20, 10))
axs = axs.flatten()

for i, var in enumerate(num_vars): 
    sns.violinplot(x=var, y="y", data=df, ax=axs[i])
    
fig.tight_layout()
    
plt.show()
In [13]:
num_vars = ["age", "duration", "campaign", "pdays", "previous", "emp.var.rate", "cons.price.idx", 
                  "cons.conf.idx", "euribor3m", "nr.employed"]

fig, axs = plt.subplots(nrows=2, ncols=5, figsize=(20, 10))
axs = axs.flatten()

for i, var in enumerate(num_vars): 
    sns.histplot(x=var, data=df, ax=axs[i])
    
fig.tight_layout()
    
plt.show()
In [14]:
num_vars = ['age', 'duration', 'campaign', 'pdays', 'previous', 'emp.var.rate', 'cons.price.idx', 
            'cons.conf.idx', 'euribor3m', 'nr.employed']

fig, axs = plt.subplots(nrows=2, ncols=5, figsize=(20, 10))
axs = axs.flatten()

for i, var in enumerate(num_vars):
    sns.histplot(x=var, hue='y', data=df, ax=axs[i], multiple="stack")

fig.tight_layout()

plt.show()
In [15]:
# Documentar sns.pairplot: https://seaborn.pydata.org/generated/seaborn.pairplot.html

# Listado de variables nĆŗmericas. 
num_vars = ['age', 'duration', 'campaign', 'pdays', 'previous', 'emp.var.rate', 'cons.price.idx', 
            'cons.conf.idx', 'euribor3m', 'nr.employed']

# Crear una Matriz para los diagramas de dispersión
sns.pairplot(df, hue='y')
Out[15]:
<seaborn.axisgrid.PairGrid at 0x136b31e06a0>

Procesamiento de los datos¶

Utilizar pandas.Series.unique¶

Devuelve valores Ćŗnicos de una serie de objetos.

Los valor únicos se devuelven en orden de aparición. Los valores Únicos se basan en tablas hash, por lo tanto, NO se ordenan.

https://pandas.pydata.org/docs/reference/api/pandas.Series.unique.html

In [16]:
df['job'].unique()
Out[16]:
array(['housemaid', 'services', 'admin.', 'blue-collar', 'technician',
       'retired', 'management', 'unemployed', 'self-employed', 'unknown',
       'entrepreneur', 'student'], dtype=object)
In [17]:
df['marital'].unique()
Out[17]:
array(['married', 'single', 'divorced', 'unknown'], dtype=object)
In [18]:
df['education'].unique()
Out[18]:
array(['basic.4y', 'high.school', 'basic.6y', 'basic.9y',
       'professional.course', 'unknown', 'university.degree',
       'illiterate'], dtype=object)
In [19]:
df['default'].unique()
Out[19]:
array(['no', 'unknown', 'yes'], dtype=object)
In [20]:
df['housing'].unique()
Out[20]:
array(['no', 'yes', 'unknown'], dtype=object)
In [21]:
df['loan'].unique()
Out[21]:
array(['no', 'yes', 'unknown'], dtype=object)
In [22]:
df['contact'].unique()
Out[22]:
array(['telephone', 'cellular'], dtype=object)
In [23]:
df['month'].unique()
Out[23]:
array(['may', 'jun', 'jul', 'aug', 'oct', 'nov', 'dec', 'mar', 'apr',
       'sep'], dtype=object)
In [24]:
df['day_of_week'].unique()
Out[24]:
array(['mon', 'tue', 'wed', 'thu', 'fri'], dtype=object)
In [25]:
df['poutcome'].unique()
Out[25]:
array(['nonexistent', 'failure', 'success'], dtype=object)
In [26]:
df['y'].unique()
Out[26]:
array(['no', 'yes'], dtype=object)

Transformar datos¶

El paquete sklearn.preprocessing proporciona varias funciones comunes que son de utilidad en la transformación de clases para cambiar los vectores de características en una representación que sea más adecuada para los estimadores posteriores.¶

En general, los algoritmos de aprendizaje se benefician de la estandarización del conjunto de datos.¶

https://scikit-learn.org/stable/modules/preprocessing.html

In [27]:
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
df['job']= label_encoder.fit_transform(df['job'])
df['job'].unique()
Out[27]:
array([ 3,  7,  0,  1,  9,  5,  4, 10,  6, 11,  2,  8])
In [28]:
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
df['marital']= label_encoder.fit_transform(df['marital'])
df['marital'].unique()
Out[28]:
array([1, 2, 0, 3])
In [29]:
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
df['education']= label_encoder.fit_transform(df['education'])
df['education'].unique()
Out[29]:
array([0, 3, 1, 2, 5, 7, 6, 4])
In [30]:
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
df['default']= label_encoder.fit_transform(df['default'])
df['default'].unique()
Out[30]:
array([0, 1, 2])
In [31]:
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
df['housing']= label_encoder.fit_transform(df['housing'])
df['housing'].unique()
Out[31]:
array([0, 2, 1])
In [32]:
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
df['loan']= label_encoder.fit_transform(df['loan'])
df['loan'].unique()
Out[32]:
array([0, 2, 1])
In [33]:
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
df['contact']= label_encoder.fit_transform(df['contact'])
df['contact'].unique()
Out[33]:
array([1, 0])
In [34]:
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
df['month']= label_encoder.fit_transform(df['month'])
df['month'].unique()
Out[34]:
array([6, 4, 3, 1, 8, 7, 2, 5, 0, 9])
In [35]:
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
df['day_of_week']= label_encoder.fit_transform(df['day_of_week'])
df['day_of_week'].unique()
Out[35]:
array([1, 3, 4, 2, 0])
In [36]:
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
df['poutcome']= label_encoder.fit_transform(df['poutcome'])
df['poutcome'].unique()
Out[36]:
array([1, 0, 2])
In [37]:
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
df['y']= label_encoder.fit_transform(df['y'])
df['y'].unique()
Out[37]:
array([0, 1])
In [38]:
df.head()
Out[38]:
age job marital education default housing loan contact month day_of_week duration campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y
0 56 3 1 0 0 0 0 1 6 1 261 1 999 0 1 1.1 93.994 -36.4 4.857 5191.0 0
1 57 7 1 3 1 0 0 1 6 1 149 1 999 0 1 1.1 93.994 -36.4 4.857 5191.0 0
2 37 7 1 3 0 2 0 1 6 1 226 1 999 0 1 1.1 93.994 -36.4 4.857 5191.0 0
3 40 0 1 1 0 0 0 1 6 1 151 1 999 0 1 1.1 93.994 -36.4 4.857 5191.0 0
4 56 7 1 3 0 0 2 1 6 1 307 1 999 0 1 1.1 93.994 -36.4 4.857 5191.0 0

Balancear las etiquetas:¶

"Y" Label

Insertar un gráfico de barras para mostrar los conteos de observaciones en cada categeria usando barras.¶

In [39]:
sns.countplot(df['y'])
df['y'].value_counts()
Out[39]:
0    36548
1     4640
Name: y, dtype: int64

Usar función resample https://scikit-learn.org/stable/modules/generated/sklearn.utils.resample.html¶

In [40]:
from sklearn.utils import resample
#Crear dos diferentes dataframe de una clase mayoritaria y minoritaria
df_majority = df[(df['y']==0)] 
df_minority = df[(df['y']==1)] 

# muestreo ascendente de la clase minoritaria
df_minority_upsampled = resample(df_minority, 
                                 replace=True,     # muesta con reemplazo 
                                 n_samples= 36548, # para que coincida con la clase mayoritaria
                                 random_state=0)   # resultados reproducible

# Combinar la clase mayoritaria con la muestra ascendente de la clase minoritaria 
df_upsampled = pd.concat([df_minority_upsampled, df_majority])
In [41]:
sns.countplot(df_upsampled['y'])
df_upsampled['y'].value_counts()
Out[41]:
1    36548
0    36548
Name: y, dtype: int64

Eliminar outliers usando IQR¶

Detectar outlier es tedioso, especialmente cuando se tienen multiples tipos de datos.

Por lo tanto, tenemos diferentes formas de detectar valores atĆ­picos para diferentes tipos de datos.

En cuanto a los datos distribuidos normalmente, podemos obtener el mƩtodo Z-Score;

Para skewed data, se usa IQR.

IQR es la diferencia entre el cuartil 75th and 25th.¶

In [42]:
def remove_outliers_iqr(df, columns):
    for col in columns:
        q1 = df[col].quantile(0.25)
        q3 = df[col].quantile(0.75)
        iqr = q3 - q1
        lower_bound = q1 - 1.5 * iqr
        upper_bound = q3 + 1.5 * iqr
        df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]
    return df

# SeƱale las columnas para remover los outliers
columns_to_check = ['age', 'duration', 'campaign', 'pdays', 'previous', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 
                    'euribor3m', 'nr.employed']

# Solicitar la función que remueve los outliers usando IQR
df_clean = remove_outliers_iqr(df_upsampled, columns_to_check)

# Mostrar el resultado en el dataframe
df_clean.head()
Out[42]:
age job marital education default housing loan contact month day_of_week duration campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y
37017 25 8 2 7 1 2 0 0 3 3 371 1 999 0 1 -2.9 92.469 -33.6 1.044 5076.2 1
36682 51 9 2 6 0 0 0 0 4 0 657 1 999 0 1 -2.9 92.963 -40.8 1.268 5076.2 1
29384 45 7 2 7 0 0 0 1 0 0 541 1 999 0 1 -1.8 93.075 -47.1 1.405 5099.1 1
21998 29 9 2 3 1 0 0 0 1 4 921 3 999 0 1 1.4 93.444 -36.1 4.964 5228.1 1
16451 37 10 2 2 1 2 2 0 3 4 633 1 999 0 1 1.4 93.918 -42.7 4.963 5228.1 1
In [43]:
df_clean.shape
Out[43]:
(49702, 21)

Correlación mostrando un heatmap¶

Seaborn es una biblioteca de python que permite hacer mejores grÔficos fÔcilmente gracias a su función heatmap(). Un mapa de calor es una representación grÔfica de datos donde cada valor de una matriz se representa como un color.

https://seaborn.pydata.org/generated/seaborn.heatmap.html

In [44]:
plt.figure(figsize=(20, 16))
sns.heatmap(df_clean.corr(), fmt='.2g', annot=True)
Out[44]:
<AxesSubplot:>

Definiendo vector de características (X) y variable target (y)¶

In [45]:
X = df_clean.drop('y', axis=1)
y = df_clean['y']

Dividir arrais o matrices en subconjuntos aleatorios de entrenamiento y prueba.¶

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Para ser precisos, el mƩtodo split() genera los ƭndices de entrenamiento y prueba, no los datos en si mismos.

Tener mĆŗltiples divisiones puede ser Ćŗtil si desea estimar mejor el rendimiento de su modelo.

In [46]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3,random_state=0)

Instanciar el modelo¶

In [47]:
from sklearn.svm import SVC
svc = SVC("c=1", "gamma=1") 
svc.fit(X_train, y_train)
Out[47]:
SVC()
In [48]:
y_pred = svc.predict(X_test)

print('Precisión en el set de Entrenamiento: {:.2f}'
     .format(svc.score(X_train, y_train)))
print('Precisión en el set de Test: {:.2f}'
     .format(svc.score(X_test, y_test)))
Precisión en el set de Entrenamiento: 0.83
Precisión en el set de Test: 0.83
In [49]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, jaccard_score
print('F-1 Score : ',(f1_score(y_test, y_pred, average='micro')))
print('Precision Score : ',(precision_score(y_test, y_pred, average='micro')))
print('Recall Score : ',(recall_score(y_test, y_pred, average='micro')))
print('Jaccard Score : ',(jaccard_score(y_test, y_pred, average='micro')))
F-1 Score :  0.8321373482663805
Precision Score :  0.8321373482663805
Recall Score :  0.8321373482663805
Jaccard Score :  0.7125301481566556
In [50]:
from sklearn.metrics import classification_report, confusion_matrix, roc_curve
print (classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.85      0.87      0.86      8875
           1       0.80      0.78      0.79      6036

    accuracy                           0.83     14911
   macro avg       0.83      0.82      0.83     14911
weighted avg       0.83      0.83      0.83     14911

In [51]:
# Matriz de confusión
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive:1', 'Actual Negative:0'], 
                                 index=['Predict Positive:1', 'Predict Negative:0'])

sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='Reds')
plt.figure(figsize=(9,9))
Out[51]:
<Figure size 900x900 with 0 Axes>
<Figure size 900x900 with 0 Axes>

Obtener C Grid Search¶

In [52]:
from sklearn.model_selection import GridSearchCV
param_grid = {"C":[0.1, 1, 10, 100, 1000], "gamma":[1, 0.1, 0.01, 0.001, 0.0001]}
In [53]:
grid = GridSearchCV(SVC(), param_grid, verbose=2)
grid.fit(X_train, y_train)
Fitting 5 folds for each of 25 candidates, totalling 125 fits
[CV] END .....................................C=0.1, gamma=1; total time= 5.6min
[CV] END .....................................C=0.1, gamma=1; total time= 5.8min
[CV] END .....................................C=0.1, gamma=1; total time= 5.8min
[CV] END .....................................C=0.1, gamma=1; total time= 5.7min
[CV] END .....................................C=0.1, gamma=1; total time= 5.8min
[CV] END ...................................C=0.1, gamma=0.1; total time= 4.7min
[CV] END ...................................C=0.1, gamma=0.1; total time= 5.2min
[CV] END ...................................C=0.1, gamma=0.1; total time= 5.3min
[CV] END ...................................C=0.1, gamma=0.1; total time= 5.2min
[CV] END ...................................C=0.1, gamma=0.1; total time= 5.3min
[CV] END ..................................C=0.1, gamma=0.01; total time= 1.8min
[CV] END ..................................C=0.1, gamma=0.01; total time= 1.8min
[CV] END ..................................C=0.1, gamma=0.01; total time= 1.8min
[CV] END ..................................C=0.1, gamma=0.01; total time= 1.8min
[CV] END ..................................C=0.1, gamma=0.01; total time= 1.8min
[CV] END .................................C=0.1, gamma=0.001; total time=  47.7s
[CV] END .................................C=0.1, gamma=0.001; total time=  48.5s
[CV] END .................................C=0.1, gamma=0.001; total time=  48.2s
[CV] END .................................C=0.1, gamma=0.001; total time=  47.8s
[CV] END .................................C=0.1, gamma=0.001; total time=  47.6s
[CV] END ................................C=0.1, gamma=0.0001; total time=  46.5s
[CV] END ................................C=0.1, gamma=0.0001; total time=  43.9s
[CV] END ................................C=0.1, gamma=0.0001; total time=  42.1s
[CV] END ................................C=0.1, gamma=0.0001; total time=  43.5s
[CV] END ................................C=0.1, gamma=0.0001; total time=  45.9s
[CV] END .......................................C=1, gamma=1; total time=30.5min
[CV] END .......................................C=1, gamma=1; total time= 5.6min
[CV] END .......................................C=1, gamma=1; total time= 6.3min
[CV] END .......................................C=1, gamma=1; total time= 6.6min
[CV] END .......................................C=1, gamma=1; total time= 7.1min
[CV] END .....................................C=1, gamma=0.1; total time= 5.4min
[CV] END .....................................C=1, gamma=0.1; total time= 5.6min
[CV] END .....................................C=1, gamma=0.1; total time= 3.6min
[CV] END .....................................C=1, gamma=0.1; total time= 3.2min
[CV] END .....................................C=1, gamma=0.1; total time= 3.3min
[CV] END ....................................C=1, gamma=0.01; total time= 1.3min
[CV] END ....................................C=1, gamma=0.01; total time= 1.3min
[CV] END ....................................C=1, gamma=0.01; total time= 1.3min
[CV] END ....................................C=1, gamma=0.01; total time= 1.3min
[CV] END ....................................C=1, gamma=0.01; total time= 1.2min
[CV] END ...................................C=1, gamma=0.001; total time=  38.2s
[CV] END ...................................C=1, gamma=0.001; total time=  33.8s
[CV] END ...................................C=1, gamma=0.001; total time=  32.7s
[CV] END ...................................C=1, gamma=0.001; total time=  34.5s
[CV] END ...................................C=1, gamma=0.001; total time=  37.8s
[CV] END ..................................C=1, gamma=0.0001; total time=  28.3s
[CV] END ..................................C=1, gamma=0.0001; total time=  28.1s
[CV] END ..................................C=1, gamma=0.0001; total time=  28.3s
[CV] END ..................................C=1, gamma=0.0001; total time=  28.4s
[CV] END ..................................C=1, gamma=0.0001; total time=  29.8s
[CV] END ......................................C=10, gamma=1; total time= 4.5min
[CV] END ......................................C=10, gamma=1; total time= 3.9min
[CV] END ......................................C=10, gamma=1; total time= 3.8min
[CV] END ......................................C=10, gamma=1; total time= 4.4min
[CV] END ......................................C=10, gamma=1; total time= 4.3min
[CV] END ....................................C=10, gamma=0.1; total time= 3.1min
[CV] END ....................................C=10, gamma=0.1; total time= 3.2min
[CV] END ....................................C=10, gamma=0.1; total time= 4.9min
[CV] END ....................................C=10, gamma=0.1; total time= 6.7min
[CV] END ....................................C=10, gamma=0.1; total time=18.5min
[CV] END ...................................C=10, gamma=0.01; total time= 4.1min
[CV] END ...................................C=10, gamma=0.01; total time= 4.2min
[CV] END ...................................C=10, gamma=0.01; total time= 4.1min
[CV] END ...................................C=10, gamma=0.01; total time= 4.1min
[CV] END ...................................C=10, gamma=0.01; total time= 4.2min
[CV] END ..................................C=10, gamma=0.001; total time= 6.2min
[CV] END ..................................C=10, gamma=0.001; total time= 1.5min
[CV] END ..................................C=10, gamma=0.001; total time= 1.0min
[CV] END ..................................C=10, gamma=0.001; total time= 1.1min
[CV] END ..................................C=10, gamma=0.001; total time= 1.0min
[CV] END .................................C=10, gamma=0.0001; total time=  50.6s
[CV] END .................................C=10, gamma=0.0001; total time=  50.4s
[CV] END .................................C=10, gamma=0.0001; total time=  49.4s
[CV] END .................................C=10, gamma=0.0001; total time=  49.9s
[CV] END .................................C=10, gamma=0.0001; total time=  50.8s
[CV] END .....................................C=100, gamma=1; total time= 6.9min
[CV] END .....................................C=100, gamma=1; total time= 6.1min
[CV] END .....................................C=100, gamma=1; total time= 5.9min
[CV] END .....................................C=100, gamma=1; total time= 6.6min
[CV] END .....................................C=100, gamma=1; total time= 6.7min
[CV] END ...................................C=100, gamma=0.1; total time= 4.9min
[CV] END ...................................C=100, gamma=0.1; total time= 4.9min
[CV] END ...................................C=100, gamma=0.1; total time= 4.9min
[CV] END ...................................C=100, gamma=0.1; total time= 4.9min
[CV] END ...................................C=100, gamma=0.1; total time= 4.9min
[CV] END ..................................C=100, gamma=0.01; total time= 3.1min
[CV] END ..................................C=100, gamma=0.01; total time= 3.1min
[CV] END ..................................C=100, gamma=0.01; total time= 3.1min
[CV] END ..................................C=100, gamma=0.01; total time= 3.1min
[CV] END ..................................C=100, gamma=0.01; total time= 3.1min
[CV] END .................................C=100, gamma=0.001; total time= 2.1min
[CV] END .................................C=100, gamma=0.001; total time= 1.7min
[CV] END .................................C=100, gamma=0.001; total time= 1.8min
[CV] END .................................C=100, gamma=0.001; total time= 1.8min
[CV] END .................................C=100, gamma=0.001; total time= 1.9min
[CV] END ................................C=100, gamma=0.0001; total time= 1.4min
[CV] END ................................C=100, gamma=0.0001; total time= 1.5min
[CV] END ................................C=100, gamma=0.0001; total time= 1.4min
[CV] END ................................C=100, gamma=0.0001; total time= 1.3min
[CV] END ................................C=100, gamma=0.0001; total time= 1.3min
[CV] END ....................................C=1000, gamma=1; total time= 6.9min
[CV] END ....................................C=1000, gamma=1; total time= 6.1min
[CV] END ....................................C=1000, gamma=1; total time= 5.9min
[CV] END ....................................C=1000, gamma=1; total time= 8.3min
[CV] END ....................................C=1000, gamma=1; total time= 6.8min
[CV] END ..................................C=1000, gamma=0.1; total time= 4.9min
[CV] END ..................................C=1000, gamma=0.1; total time= 4.9min
[CV] END ..................................C=1000, gamma=0.1; total time= 4.9min
[CV] END ..................................C=1000, gamma=0.1; total time= 5.0min
[CV] END ..................................C=1000, gamma=0.1; total time= 5.0min
[CV] END .................................C=1000, gamma=0.01; total time= 3.3min
[CV] END .................................C=1000, gamma=0.01; total time= 3.8min
[CV] END .................................C=1000, gamma=0.01; total time= 3.4min
[CV] END .................................C=1000, gamma=0.01; total time= 3.1min
[CV] END .................................C=1000, gamma=0.01; total time= 3.3min
[CV] END ................................C=1000, gamma=0.001; total time= 4.2min
[CV] END ................................C=1000, gamma=0.001; total time= 3.9min
[CV] END ................................C=1000, gamma=0.001; total time= 4.0min
[CV] END ................................C=1000, gamma=0.001; total time= 4.3min
[CV] END ................................C=1000, gamma=0.001; total time= 4.3min
[CV] END ...............................C=1000, gamma=0.0001; total time= 3.9min
[CV] END ...............................C=1000, gamma=0.0001; total time= 3.9min
[CV] END ...............................C=1000, gamma=0.0001; total time= 3.8min
[CV] END ...............................C=1000, gamma=0.0001; total time= 3.9min
[CV] END ...............................C=1000, gamma=0.0001; total time= 4.2min
Out[53]:
GridSearchCV(estimator=SVC(),
             param_grid={'C': [0.1, 1, 10, 100, 1000],
                         'gamma': [1, 0.1, 0.01, 0.001, 0.0001]},
             verbose=2)
In [57]:
grid.best_params_
Out[57]:
{'C': 1, 'gamma': 1}

SVM Funciones Kernel¶

Modelo SVM usando Linear Kernel function¶

In [58]:
linear_classifier = SVC(kernel='linear').fit(X_train,y_train)
y_pred = linear_classifier.predict(X_test)
print('Model accuracy with linear kernel : {0:0.3f}'. format(accuracy_score(y_test, y_pred)))
Model accuracy with linear kernel : 0.826
In [59]:
# Matriz de confusión para SVM Linear Kernel
cm = confusion_matrix(y_test, y_pred)
cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive:1', 'Actual Negative:0'], 
                                 index=['Predict Positive:1', 'Predict Negative:0'])

sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='mako')
Out[59]:
<AxesSubplot:>
In [60]:
print(classification_report(y_test,y_pred))
              precision    recall  f1-score   support

           0       0.93      0.77      0.84      8875
           1       0.73      0.91      0.81      6036

    accuracy                           0.83     14911
   macro avg       0.83      0.84      0.82     14911
weighted avg       0.85      0.83      0.83     14911

In [61]:
# Modelo SVM usando la función kernel Gaussian RBF
In [64]:
rbf_svc=SVC(kernel='rbf').fit(X_train,y_train)
y_pred = rbf_svc.predict(X_test)
print('Model accuracy with rbf kernel : {0:0.3f}'. format(accuracy_score(y_test, y_pred)))
Model accuracy with rbf kernel : 0.832
In [65]:
# Matriz de confusión Gaussian RBF
cm = confusion_matrix(y_test, y_pred)
cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive:1', 'Actual Negative:0'], 
                                 index=['Predict Positive:1', 'Predict Negative:0'])

sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='mako')
Out[65]:
<AxesSubplot:>
In [66]:
# Modelo SVM usando la función kernel Polynomial
In [68]:
Poly_svc=SVC(kernel='poly', C=1).fit(X_train,y_train)
y_pred = Poly_svc.predict(X_test)
print('Model accuracy with poly kernel : {0:0.3f}'. format(accuracy_score(y_test, y_pred)))
Model accuracy with poly kernel : 0.836
In [69]:
# Matriz de confusión Polynomial
cm = confusion_matrix(y_test, y_pred)
cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive:1', 'Actual Negative:0'], 
                                 index=['Predict Positive:1', 'Predict Negative:0'])

sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='mako')
Out[69]:
<AxesSubplot:>
In [70]:
# Modelo SVM usando la función kernel Sigmoid
In [71]:
Sig_svc=SVC(kernel='sigmoid', C=1).fit(X_train,y_train)
y_pred = Sig_svc.predict(X_test)
print('Model accuracy with sigmoid kernel : {0:0.3f}'. format(accuracy_score(y_test, y_pred)))
Model accuracy with sigmoid kernel : 0.705
In [72]:
# Matriz de confusión Sigmoid
cm = confusion_matrix(y_test, y_pred)
cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive:1', 'Actual Negative:0'], 
                                 index=['Predict Positive:1', 'Predict Negative:0'])

sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='mako')
Out[72]:
<AxesSubplot:>

La función Kernel que mejor ajusta es Polynomial Kernel function¶

Fin¶